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Abstract 

Background: Bald cypress {Toxodium distichum vor. distichum) is a coniferous tree of trennendous ecological and 
econonnic innportance. It is a member of the family Cupressaceae which also includes cypresses, redwoods, sequoias, 
thujas, and junipers. While the bald cypress genome is more than three times the size of the human genome, its IC 
DNA content is amongst the smallest of any conifer. To learn more about the genome of bald cypress and gain insight 
into the evolution of Cupressaceae genomes, we performed a Cot analysis and used Cot filtration to study Toxodium 
DNA. Additionally, we constructed a 6.7 genome-equivalent BAC library that we screened with known Toxodium genes 
and select repeats. 

Results: The bald cypress genome is composed of 90% repetitive DNA with most sequences being found in low to 
mid copy numbers. The most abundant repeats are found in fewer than 25,000 copies per genome. Approximately 
7.4% of the genome is single/low-copy DNA (i.e., sequences found in 1 to 5 copies). Sequencing of highly repetitive Cot 
clones indicates that most Toxodium repeats are highly diverged from previously characterized plant repeat sequences. 
The bald cypress BAC library consists of 606,336 clones (average insert size of 1 13 kb) and collectively provides 6.7-fold 
genome equivalent coverage of the bald cypress genome. Macroarray screening with known genes produced, on 
average, about 1.5 positive clones per probe per genome-equivalent. Library screening with Cot-1 DNA revealed that 
approximately 83% of BAC clones contain repetitive sequences iterated 10^ to 10"^ times per genome. 

Conclusions: The BAC library for bald cypress is the first to be generated for a conifer species outside of the family 
Pinaceae. The Toxodium BAC library was shown to be useful in gene isolation and genome characterization and 
should be an important tool in gymnosperm comparative genomics, physical mapping, genome sequencing, and 
gene/polymorphism discovery. The single/low-copy (SL) component of bald cypress is 4.6 times the size of the 
Arobidopsis genome. As suggested for other gymnosperms, the large amount of SL DNA in Toxodium is likely the 
result of divergence among ancient repeat copies and gene/pseudogene duplication. 



Background 

The conifer family Cupressaceae contains many remark- 
able and important trees including junipers, redwoods, 
sequoias, cypresses, and thujas [1]. One Cupressaceae spe- 
cies that is of tremendous ecological importance to the 
southeastern U.S. is bald cypress, Toxodium distichum (L.) 
Rich var. distichum [2]. Bald cypress is the cornerstone 
species in the aptly named "cypress swamps" where it 
serves as a source of food and shelter for numerous and 
sundry organisms [3]. Though native to the U.S. South, 
bald cypress is a popular ornamental throughout much of 
the world; indeed, it has been cultivated in Europe since at 
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least the mid 17* Century [4]. Bald cypress wood is extre- 
mely resistant to wind, water, pathogens, and pests, some- 
thing that perhaps is not surprising when one considers 
that individual trees may spend their entire life (sometimes 
> 1500 years - [5]) partially submerged in water. The 
highly durable wood of bald cypress is used in construc- 
tion of boats, docks, bridges, and roofing shingles [6], 
although the tree's relatively slow growth-rate has limited 
its use as a wood crop. Unlike most conifers, bald cypress 
is deciduous with leaves that change from light green to 
brown in the fall. Its attractive appearance and hardiness 
have made it a popular ornamental throughout the eastern 
U.S. [2]. 

The genus Taxodium consists of one to three extant 
species, depending upon taxonomic preference. The 
most conservative treatment places all trees in a single 
species {T. distichum) with three varieties; specifically 
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bald cypress {T. distichum var. distichum), pond cypress 
(r. distichum var. imhricarium), and Montezuma bald 
cypress (T. distichum var. mexicanum). While the single 
species treatment is phylogenetically warranted [7-9], 
sociological reasons have kept the multi-species nomen- 
clature in place - e.g., Montezuma bald cypress is the 
national tree of Mexico [10]. 

The bald cypress IC DNA content is 9731 Mb [11] 
which places it amongst the smallest of conifer gen- 
omes [12]. Bald cypress possesses 2n = 2x = 22 chro- 
mosomes [13] and is a diploid like most members of 
the Cupressaceae [12]. 

Taxodium has not been the subject of molecular map- 
ping and/or EST sequencing. In the Cupressaceae, mole- 
cular research has largely focused on Cryptomeria 
japonica, and molecular maps based upon EST, RFLP, 
RAPD, and isozyme markers exist for this species [14-16]. 
However, these genetic maps cover only a small part of 
the entire Cryptomeria genome. 

To advance understanding of Taxodium and the 
Cupressaceae in general, we utilized three experimental 
tools. Specifically: 

(1) Cot analysis - The study of DNA reassociation 
kinetics in solution is known as Cot analysis [17]. It is one 
of the earliest means of studying genome structure predat- 
ing cloning and DNA sequencing techniques by several 
years (see 18 for review). Cot analysis is based upon the 
observation that the product of DNA concentration (C^), 
reassociation time {t), and a "buffer factor" accounting for 
cation concentration {S) has a predictable effect on the 
amount of reassociation occurring in a denatured DNA 
sample [18]. The major unknown factor influencing reas- 
sociation is the underlying sequence composition of the 
DNA. Consequently, one can indirectly study genome 
sequence composition by exploring how changes in CotS 
(known by the colloquialism "Cot") influence reassocia- 
tion. Typically, a graph is created in which the fraction of 
reassociated DNA is plotted against the logarithm of Cot 
(from Cot ~ 0 to Cot values at which reassociation is 
essentially complete). The resulting scatter plot is analyzed 
using nonlinear regression analysis, and a least-squares 
curve is fit through the data. This graph, known as a Cot 
curve, provides a visual representation of the genome. 
Analysis of Cot data provides the number of kinetic com- 
ponents in a genome, the reassociation rate (/c) of each 
component, the fraction of the genome found in each 
component, the kinetic complexity (i.e., estimated 
sequence complexity) of each component, and each com- 
ponent s average sequence iteration. Additionally, in some 
instances the genome size of an organism can also be esti- 
mated through comparison of the k value for the single/ 
low copy component of the organism of interest with the 
k and genome size of E, coli [19]. 



(2) Cot filtration (CF) - CF represents a merger between 
Cot analysis and high-throughput DNA sequencing 
[19-21]. In short, the results of a Cot curve are used to 
guide fractionation of a genome into its kinetic compo- 
nents, and isolated components are sequenced in full or 
part. The value of CF and other reduced-representation 
sequencing techniques lies in their ability to enrich for 
subsets of genomic DNA of interest [19,22]. Since the 
majority of genes are single/low-copy in nature, CF has 
been used to enrich for gene space including the promo- 
ters and introns missed by cDNA approaches [20,23,24]. 
Alternatively, sequencing of a highly repetitive component 
represents a means of efficiently exploring the repetitive 
landscape of a genome [25] . 

(3) BAC library analysis - Bacterial artificial chromo- 
somes (BACs) have been the most popular large-insert 
cloning vectors for nearly 20 years [26], and ordered BAC 
libraries (i.e., libraries in which clone is individually 
archived) remain highly useful tools in modern genomics 
research [27]. One can efficiently map molecular markers 
to corresponding BACs via multiplex macroarray hybridi- 
zation techniques or by using multiplex PCR strategies 
[28-30]. By combining macroarray/PCR mapping data 
with data from BAC end sequencing, DNA fingerprinting, 
and/or sequencing of BAC pools, one can generate highly 
accurate physical (DNA sequence) maps and identify 
minimum BAC tiling paths representing whole or nearly 
whole chromosomes [27]. Though construction of BAC 
minimum tiling paths will likely become less important as 
DNA sequencing becomes cheaper and faster, BAC 
libraries will likely remain a key means of bridging gaps 
and resolving anomalies in shotgun sequence-based scaf- 
folds [31]. 

The bald cypress BAC library generated in this study 
affords 6.7 genome equivalent coverage of the bald cypress 
genome. Though BAC libraries exist for several Pinaceae 
conifers [31-33], to our knowledge the bald cypress BAC 
library is the first constructed for a Cupressaceae species. 
The library was shown to be useful in gene isolation and 
genome characterization. The Cot and CF-based repeat 
analyses suggest that highly diverged, low-copy repeats 
account for much of bald cypress' genomic DNA. 

Results 

Cot analysis 

The CotQuest [34] nonlinear regression model providing 
the best fit of the renaturation kinetics data was a three- 
component fit in which outliers had been removed (using 
CotQuest s built in outlier detection) and the reassociation 
rate (/c) of the slowest reassociating component had been 
fixed based upon the genome size of bald cypress. This 
best fit Cot curve is shown in Figure 1 while the major 
biological characteristics obtained from curve analysis are 
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Figure 1 Cot curve for bald cypress. (A) A least-squares curve (black line) was fit through the data points (gray circles) using the CotQuest 
program of Bunge et al. [34]. The curve consists of highly repetitive (HR), moderately repetitive (MR) and single/low copy (SL) components, 
characterized by fast, intermediate, and slow reassociation, respectively. Blue, green, and red diamonds mark the Cotl/2 values of the HR, MR, 
and SL components. The brackets centered at a particular Cotl/2 marker show the "two Cot decade region" in which 80% of the sequences in 
that component will renature [63]. The blue shaded region at the top of the curve shows the double-stranded HR DNA (plus foldback 
sequences) isolated for sequencing. Biological data obtained from curve analysis is shown in Table 1. (B) The predicted individual renaturation 
profiles of the HR component, MR component, and SL component are shown. 



shown in Table 1. Of note, the curve is composed of 
highly repetitive (HR), moderately repetitive (MR), and 
single/low-copy (SL) components accounting for 47.0, 
41.1, and 7.4% of the genome, respectively. Because the 
best fit was obtained when the reassociation rate of the SL 
component was fixed based upon genome size, the curve 
cannot be used to produce an estimate of DNA content. 
Assuming that the SL component has a repetition fre- 
quency of 1, the average repetition frequency of the DNA 
in other components can be estimated by dividing their k 
values by the /rvalue of the SL component [35]. The mean 
predicted repetition frequencies of sequences in the MR 
and HR components are 61 and 2054, respectively. 



For each component, 80% of the sequences in that 
component will reassociate in the "two Cot decade 
region" (TCDR) flanking the component's Cotl/2 value, 
i.e., if a component's Cotl/2 is j, then 80% of sequences 
can be found between OAy and lOj. Because k (and 
hence Cotl/2 which is the inverse of k) can be directly 
related to sequence copy number, one can use the 
TCDR to predict the range in sequence iteration for 
80% of the sequences in a particular component. For 
example, the bald cypress MR component has a mean 
repeat frequency of 61 while 80% of the MR sequences 
are repeated from 6.1 to 610 times. Likewise, for the HR 
component which has a mean repetition frequency of 



Table 1 Results of bald cypress Cot curve analysis 



Component 


Fraction of genome 


KnCx^ (Mb) 


k {M-^'S^) 


Cotl/2 (M-s) 




HR 


0.4702 


0.2126 


0.2157 


4.64 


2054 


MR 


0.4114 


65.55 


0.00637 


56.99 


61 


SL 


0.0744 


719.2 


0.000105 


9523.81 


1 


Other 


Foldback 


0.0174 










Unreassociated 


0.0233 











^KnCx = kinetic complexity 

"^MRF = mean repetition frequency 
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2054, 80% of elements in the component have iteration 
frequencies between 205.4 and 20,540. 



length was 2.3 bp, and the mean terminal-inverted repeat 
length was 8.8 bp. 



Cot filtration 

743 high-quality HR capillary sequence reads were ana- 
lyzed using the Sequence Read Classification Pipeline 
(SRCP) of Chouvarine et al. [36]. As shown in Figure 2, 
the vast majority of reads (72.49%) showed no obvious (S' 
> 60) homology to previously characterized repeat 
sequences, gene sequences, and/or each other. Chloroplast 
DNA, rDNA, and "Probable Repeats" (i.e., sequences that 
appear to be repetitive based upon their relative frequency 
in the HR sequence set) each account for about 6-8% of 
the HR reads. Only 0.54% of sequenced HR sequences 
shared significant sequence identity with known mobile 
elements. 

The program MUST [37] was used to identify potential 
miniature inverted-repeat transposable elements (MITEs) 
in the HR sequences. MITEs are non-autonomous DNA 
elements characterized by terminal-inverted repeats, tar- 
get-site duplications (direct repeats), and an internal 
region with no coding sequence [38]. Those potential 
MITEs found in sequence reads classified as "Unknown" 
or "Probable Repeats" by the SRCP are provided in Addi- 
tional file 1. Using the criteria described in the Materials 
and Methods, 53 potential MITEs were identified. The 
mean MITE length was 246 bp, the mean direct repeat 
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Figure 2 Categorization of bald cypress HR sequences. The HR 

sequence reads were classified using tine Sequence Read 
Classification Pipeline of Chouvarine et al. [36] with minor 
modifications detailed in the Materials and Methods. Note that the 
majority of reads (72.49%) show no significant to homology to 
known gene or repeat sequences, nor are they categorized as 
"Probable Repeats" by the SRCP's ab initio repeat analysis functions. 



BAC library construction and characterization 

The bald cypress BAC library consists of 606,336 indivi- 
dually-archived clones in 1579 384-well microtiter 
plates. The library was given the designation TDD_Ba in 
accordance with the library naming standards used by 
the Mississippi Genome Exploration Laboratory (MGEL; 
see http://www.mgeLmsstate.edu/dna_libs.htm), the 
Clemson University Genomics Institute (http://www. 
genome.clemson.edu), and the Arizona Genomics Insti- 
tute (http://www2.genome.arizona.edu). The library and 
its associated products are distributed by MGEL (http:// 
www.mgel.msstate.edu). 

Periodically during BAC library construction, pulsed- 
field gel electrophoresis analysis was used to check the 
insert size of randomly-selected A/^o^I-digested BACs. 
Inserts ranged in size from 10 kb to 202 kb (Figure 3A), 
and the mean insert size for the library was 113 kb (Figure 
3B). The origin of the BAC insert DNAs was confirmed by 
Southern hybridization using radiolabeled bald cypress 
genomic DNA as a probe. As shown in Figure 3C, there is 
considerable variation in hybridization intensity between 
BAC inserts (Figure 3C) that cannot be accounted for by 
differences in DNA amount per lane (compare Figures 3B 
and 3C). Those clones with the strongest hybridization 
signals ostensibly contain more repetitive DNA than those 
with weaker signals. 

Approximately 1.8% of clones in the library were false 
positives (i.e., they possessed a vector plasmid but had no 
discernible insert). The proportion of false positive clones 
was greater earlier in library construction - e.g., the first 
660 plates had a false positive percentage of 3.05 while the 
remaining 919 plates had a false positive percentage of 
0.95. The addition of a "pre-electrophoresis" size-selection 
step (see Materials & Methods) was likely responsible for 
the reduced number of false positives in the latter three- 
fifths of the library. 

The level of chloroplast DNA contamination in the 
library was estimated by macroarray hybridization. Four 
loblolly pine {Pinus taeda L.) sequences representing dif- 
ferent regions of the chloroplast genome [31] were used to 
probe five 4x4 double-spotted bald cypress macroarrays. 
We found that 2348 (approximately 2.5%) inserts of the 
92,160 Taxodium clones represented on the macroarrays 
contained chloroplast DNA. This is the highest level of 
chloroplast contamination for any plant BAC library we 
have constructed, but still within the range of reported 
values (see reference [31] for discussion). 

While macroarray screening with mitochondrial DNA 
was not performed, sequencing of nuclear DNA prepared 
using our nuclear DNA isolation protocol suggests that 
mitochondrial DNA contamination is 10 to 100 times 
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Figure 3 Characterization of the bald cypress BAC library. (A) Size distribution of BAC inserts in tine TDD_Ba library as determined by pulsed- 
field gel electrophoresis (PFGE) of /Votl-digested BAC clones. (B) Image of a PFGE gel showing Not\ digested BAC clones. The BAC vector is visible 
as a 7.5 kb band near the bottom of each sample lane. The New England BioLabs Lambda Ladder is shown in the far right lane. (C) Southern 
blot of the gel shown in B hybridized with bald cypress genomic DNA. Note that some clones exhibit strong hybridization (blue arrows) while 
others exhibit relatively weak hybridization (red arrows). Clones with strong hybridization signals ostensibly contain more repetitive DNA than 
those with weak signals. 



less frequent than contamination from chloroplast DNA 
(reference 31 and Figure 2). 

After subtracting the fractions of false positive clones 
and chloroplast DNA-containing clones, the number of 
clones containing bald cypress nuclear DNA was esti- 
mated to be 580,263. Thus the fraction of clones in the 
library containing nuclear DNA was 0.957 (i.e., 606,336 
^ 580,263 = 0.957), and the number of nuclear DNA- 
containing clones on each macroarray was determined 
to be 17,639 (i.e., 18,432 x 0.957). Using the average 
insert size of 113 kb, the library contains approximately 
6.74 genome equivalents of bald cypress nuclear DNA 
[(580,263 X 113 kb) ^ 9731 Mb = 6.74)]. This level of 
genome equivalent coverage affords a 99.88% chance 
that a particular genomic sequence of interest will be 
found at least once in a library [see 39]. 

To further assess the quality and the utility of the 
library, eight T. distichum var. distichum single-copy 
gene sequences, employed previously in molecular phy- 
logenetics research [40-42], were used to screen five 4 x 
4 macroarrays. The five macroarrays collectively repre- 
sented 1.02 (rounded to 1.0 for simplicity) genome 



equivalents of bald cypress nuclear DNA. The gene 
sequences (Table 2) were used to design PGR primers 
and/or overgo probes for each gene (Additional file 2, 
Tables SI and S2), and macroarrays were separately 
screened with a pool containing the radiolabeled overgos 
and a pool containing the radiolabeled PGR amplicons. 
As shown in Table 3, the average number of positive 



Table 2 Bald cypress genes used in macroarray analysis 



Gene 


Gene 
abbreviation 


GenBank 
Accession 


Putative ammonium transporter 


AMT 


AB211839 


Aquaporin 


AQU 


AB211841 


Calmodulin 


Col 


AB2 11840 


Pollen allergen 


Cryj2 


AB2 11842 


Class 1 chitinase 


Chil 


AB096607 


Ferredoxin 


Fen 


AB096608 


Glutamyl-tRNA reductase 


HemA 


AB161815 


Lycopene beta cyclase 


Lcyb 


AB096608 


Phosphoribosylanthranilate 
transferase 


Pot 


AB161910 



Table 3 Interrogation of macroarrays with overgo and PCR amplicon gene-specific probes 



# of probes in hyb. 
mix 


Genes represented by probes in 
hyb. mix 


Probe 
type" 


# of pos. 
clones'' 


Library addresses of positive clones*^ 


Mean hits per 
genome 


5 


Chil, Fen, Pat, Hem A, Lcyb 


OVG 


7 


1052C3^, 106918"^, 113707^, 1227H24, 1258G13, 1271L2, 1276E22 


1.4 


8 


AMT, AQU, Col, Chil Cryj2, Fen, Lcyb, 
HemA 


PCR 


12 


721G13, 724123, 738H3, 749014, 775F3, 813M13, 822N2, 826H4, 835L14, 836114, 
864F18, 885 D20 


1.5 



^OVG = overgo (22-mer); PCR = PCR amplicon (each ca. 500 bp) 
^Per one genome equivalent, i.e., per 73,728 clones 

^Clone addresses are composed of a plate number, a row letter, and a column number, e.g., 1052N22 is the clone found in plate 1052, row N, column 22. 

■^A dot-blot experiment verified that clones 1052N22, 1069H17, and 113707 contained the HemA, Chil, and Pat genes, respectively. Specific clone/probe relationships were not determined for the other clones. 
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clones recognized by a probe was 1.4 for the overgo 
pool and 1.5 for the amplicon pool. To verify the utility 
of the library in gene isolation, positive clones 106918, 
1052C3 and 113707 were spotted onto nylon mem- 
branes and probed with various combinations of the 
overgo probes for the Chil, Ferr, Pat, HemA and Lcyb 
genes. The results of these dot-blot hybridizations 
revealed that clones 106918, 1052C3, and 113707 con- 
tained the Chil, HemA, and Pat genes (in full or part), 
respectively. 

Library screening with repetitive sequences of bald 
cypress 

For estimation of repetitive DNA content, blots prepared 
from pulsed-field gels of random A/'o^Z-digested BAG 
clones were probed with radiolabeled Cot < 1 M-s (i.e., 
Cot-1) DNA. Based on the Cot data, Cot-1 DNA should 
contain sequences repeated, on average, 9523 times. 
Among the 328 clones on the various Southern blots, 
273 showed obvious hybridization signals using the Cot- 
1 probe, i.e., approximately 83.2% of the BAC clones in 
the library appear to have inserts containing repetitive 
elements found hundreds to thousands of times in the 
genome (data not shown). 

To study the distribution of a particular retroelement in 
the Taxodium genome, a 4 x 4 macroarray was screened 
with an overgo probe (Additional file 2, Table S3) 
designed from a bald cypress HR Cot-filtered sequence 
[GenBank: ET185333] with high sequence identity (S' = 
128) to two Gingko biloba copia-\ike retroelement reverse 
transcriptase sequences [GenBank: DQ054445, GenBank: 
DQ054446]. Of the 17,639 nuclear DNA-containing clones 
on the macroarray, 888 (i.e., 10.1%) exhibited hybridization 
to the overgo which represents a reverse transcriptase 
gene we have deemed TdCRTl for Taxodium distichum 
copia-like reverse transcriptase 1 (Figure 4). Densitometer 
analysis of the macroarray suggests that TdCRTl is found 
in roughly 23,892 copies in the bald cypress genome. An 
initial glance at TdCRTl hybridization to macroarrays 
suggests that the element is found in clusters, i.e., it is not 
distributed randomly throughout the genome. To test this 
hypothesis, we used the probabilistic "urn model" method 
of Hoist [43] as detailed in Shan et al. [44]. Using an aver- 
age BAC insert size of 113 kb, an estimated 17,639 nuclear 
DNA-containing clones per macroarray, and a genome 
size of 9731 Mb, we calculated that each macroarray 
represents about 0.205X coverage of the Taxodium gen- 
ome [i.e., 17,639 ♦ 113 kb)/9731 Mb = 0.205]. Thus one 
would predict that one macroarray should contain 4,898 
copies of TdCRTl (i.e., 23,892 copies ♦ 0.205 = 4,898). If 
TdCRTl distribution were indeed random, we would 
expect that the distribution of clones lacking TdCRTl (i.e., 
lacking hybridization signal) would approximate normality. 
The mean number of clones expected to lack a TdCRTl- 




Figure 4 TdCRTl hybridization to BAC clones. Close-up of a 
portion of a bald cypress 4x4 macroarray probed with the TdCRTl 
overgo. Because each clone is double-spotted on the macroarray, 
positive signals are indicated by closely adjoined dots of similar 
hybridization intensity (e.g., colored circles). Presumably clones 
exhibiting the most intense hybridization (e.g., clone in red circle) 
contain the highest number of TdCRTl sequences while clones with 
intermediate or low numbers of TdCRTl exhibit moderate and low 
hybridization, respectively (e.g., clones in green and blue circles). 

V ) 



containing element and the theoretical standard deviation 
{SD) can be estimated using Theorem 2 of Hoist [43]. Spe- 
cifically, 



mean = N^""^'^ ; SD = 




where N is the number of nuclear DNA-containing 
clones on the macroarray (i.e., 17,639), n is the expected 
number of clones showing TdCRTl hybridization (i.e., 
4,898), and p/^ is the probability of any copy of the ele- 
ment "falling" within a clone (i.e., 1/17,639). Insertion of 
the preceding values into the equations produced a 
mean of 13,362 clones per macroarray with a standard 
deviation of 26. However, we determined that 16,751 of 
the 17,639 clones on the macroarray did not exhibit 
TdCRTl hybridization. Consequently, the observed 
number of clones lacking TdCRTl hybridization is 642 
times (16,751/26 = 642) the expected standard deviation 
for a normal distribution, strongly supporting the pre- 
mise that the distribution of TdCRTl is not random. 

Of note, nine overgos representing common Tyl- 
copia-like or Ty3 -gypsy -like repeats from loblolly pine 
were hybridized to bald cypress macroarrays. None of 



Liu et al. BMC Genomics 201 1, 12:553 
http://www.biomedcentral.eom/1 471 -21 64/1 2/553 



Page 8 of 1 2 



the loblolly pine repeat-based overgos hybridized to the 
Taxodium filters (data not shown). 

Discussion 

The genome of bald cypress, a Cupressaceae conifer, is 
three times the size of the human genome but only half 
the size of the genomes of most Pinaceae conifers [12]. 
Bald cypress appears to be a diploid like its Pinaceae cou- 
sins and, since evidence suggests that conifers do not 
likely require or utilize more genes than other seed plants 
[45], a natural hypothesis is that the bald cypress genome 
contains lots of repetitive, non-coding DNA. Our Cot 
analysis supports this general tenet as the repetitive frac- 
tion of the bald cypress genome is substantial (HR + MR 
= 90.5%). However, the HR and MR components of bald 
cypress have relatively low mean repetition frequencies 
(2054 and 61, respectively). The idea that bald cypress 
repeats tend to be found in low copy numbers is sup- 
ported by analysis of the HR sequences with the de novo 
repeat detection functions of the Sequence Read Classifi- 
cation Pipeline (SRCP) [36]. SRCP analysis reveals that 
only 6.2% of 743 bald cypress HR reads showed enough 
sequence identity with each other to be grouped by the 
SRCP into the "Probable Repeats" category. In compari- 
son, SRCP evaluation of random Sanger sequence reads 
from Sorghum bicolor (sorghum), a plant with a genome 
one-thirteenth the size of bald cypress [46], resulted in 
categorization of 27% of Sorghum reads as "Probable 
Repeats" despite the fact that the Sorghum sequences 
were not specifically enriched for repeats [36]. 

Why the Taxodium genome has so much low-copy 
repetitive DNA is unknown. A reasonable guess is that 
the majority of mobile element amplification events in 
bald cypress occurred tens to hundreds of millions of 
years ago, and since then element proliferation/activity 
has been severely restricted. If mobile element amplifica- 
tion was suppressed long ago, the copies of each repeti- 
tive element would presumably begin to diverge until 
they were eventually no longer readily recognizable as 
being derived from the same source. Indeed, the limited 
studies of mobile elements in conifers do indicate that 
there is considerable sequence divergence among mem- 
bers of retroelement families [31,47,48]; moreover, much 
of the DNA within sequenced regions of conifer genomes 
lacks features characteristic of known mobile elements, 
low-complexity repeats, or genes [31,49]. Since all coni- 
fers possess enormous genomes, it is possible that some 
of the mobile element amplification events that underlie 
the behemoth C-values of conifers occurred prior to the 
divergence of the major conifer families; this idea should 
be much easier to test once genome sequences become 
available for conifers. 

Only 0.54% of reads (i.e., four reads total) showed rea- 
sonable homology (S' = 60) to previously annotated 



transposons; top hits were to copia-hke retroelements 
from Ginkgo biloba (GenBank: DQ054445, GenBank: 
DQ054446), Sequoiadendron giganteum (AJ290723), and 
Picea glauca [GenBank: AF229252], and ^gypsy-like ele- 
ment horn Abies veitchii [GenBank: AJ002621]. The HR 
read with the highest level of similarity to a previously 
characterized retroelement [GenBank: ET185333] was 
used to design an overgo probe (see Additional file 2, 
Table S3) that was hybridized to bald cypress macroar- 
rays. The element represented by the overgo, i.e., 
TdCRTl, was estimated to be present in approximately 
23,892 copies per IC genome. As described above, 80% 
of HR sequences are present in copy numbers between 
205.4 and 20,540. If half the remaining 20% of sequences 
in the HR component reassociate after lO'Cotl/2, then 
we would predict that the remaining half (i.e., 10%) 
would reassociate prior to 0.1 -Cot 1/2. In other words, we 
would expect less than 10% of HR sequences to have an 
iteration frequency greater than 20,540. Since TdCRTl 
has a frequency of 23,892 copies per genome, it seems 
reasonable to assume that few, if any, genome sequences 
are more redundant than TdCRTl. Statistical evaluation 
of the distribution of TdCRTl among clones on a macro- 
array suggests that it is found in a decidedly non-random 
distribution throughout the bald cypress genome. That 
one of the most redundant sequences in the 9.7 Gb Tax- 
odium genome is found in only 23,892 copies and is dis- 
tributed in a decidedly non-random fashion initially 
struck us as a bit surprising. However, as Baucom et al. 
[50] note, "Two of the many misconceptions about TE 
(transposable element) properties in higher eukaryotes 
are that they are highly repetitive and are randomly scat- 
tered about the genome." 

Our previous studies on the loblolly pine genome sug- 
gest that, as in bald cypress, the majority of pine repetitive 
DNA sequences are highly diverged and ostensibly 
ancient. However, in contrast to bald cypress, there appear 
to be at least a few retroelement families in loblolly pine of 
more pronounced conservation (and ostensibly of more 
recent origin). For example, the retroelement IFG7 [51] is 
found in 210,557 copies and accounts for about 5.8% of 
the pine genome [31] while the Athila-like retroelement 
Gymny is found in approximately 21,700 copies and 
accounts for 1.3% of the loblolly pine genome [47]. 
Together, IFG7 and Gymny make up 1.54 Gb of the 
loblolly pine genome, a value 9.8 times that of the Arabi- 
dopsis thaliana genome [52]. As the pine genome is 
roughly twice the size of that of bald cypress, it is possible 
that the activity of IPG- 7, Gymny, and other mobile ele- 
ment families are, in part, responsible for the larger gen- 
ome of loblolly pine (and other Pinaceae conifers) 
compared to bald cypress. 

The SL component of bald cypress accounts for 7.4% of 
its genome. This is an amount 4.6 times the size of the 
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Arabidopsis genome. Because conifers appear to have 
functional gene numbers similar to those of diploid 
angiosperms [45], it is possible that much of the SL DNA 
of Taxodium is composed of repetitive elements that 
have diverged to the point that they are now "single- 
copy" in nature, and indeed, studies on other conifers 
have suggested that the high sequence complexity of 
their low-copy DNA is likely due, in part, to repeat diver- 
gence [31,47,48]. Duplication of genes and/or pseudo- 
gene production may also contribute to the size of the SL 
component. It is interesting to note that the macroarray 
hybridization results using probes for Taxodium "single- 
copy" genes (Table 2) indicate that there are 1.4-1.5 posi- 
tive clone hits per probe per estimated IC genome 
equivalent. This result could indicate that BAG library 
coverage is actually greater than what we calculated 
based on insert size and clone number or that the gen- 
ome size of bald cypress is actually smaller than pre- 
viously reported. Alternatively, some of our probes may 
be hybridizing to multiple loci. Because our analysis was 
based upon macroarray hybridization and not PGR, it is 
also possible that some of the amplicon and/or overgo 
probes we developed contain regions with significant 
sequence identity to loci not amplified using the primers 
from the previous phylogenetic studies. Such loci could be 
paralogous genes or pseudogenes. Of note, pseudogenes 
have been described in a number of Pinaceae conifers 
including pines [49,53-56], larches [57], and spruces 
[58,59]. In a recent examination of ten BAG sequences 
from loblolly pine, Kovach et al. [49] found that pseudo- 
genes were five times more common than genes with 
potential protein coding functions - whether such a high 
pseudogene level holds for the genome as a whole is 
unknown. To our knowledge, pseudogenes have not been 
reported in Gupressaceae conifers, although the amount 
of sequence information for Gupressaceae is minute com- 
pared to that for Pinaceae. As a means of exploring 
whether pseudogenes are common in bald cypress, we 
plan to sequence Taxodium BAG clones including those 
recognized by the probes listed in Table 2. 

Hybridization of loblolly pine retroelement sequence 
probes to macroarrays of Taxodium produced no positive 
signals. This is not particularly surprising as fossil evidence 
indicates that the Pinaceae and the Gupressaceae have 
been separate lineages for roughly 250 million years [1]. 

Analysis of the bald cypress HR sequence with MUST 
[37] resulted in detection of 53 potential MITEs (Addi- 
tional file 1). None of the MITEs shared terminal- 
inverted repeats, and hence each potential MITE was 
considered a member of its own MITE family. Further 
analysis of the potential MITEs will be facilitated by 
sequencing of bald cypress BAGs and random genomic 
DNA. 



Conclusions 

We have explored the genome of bald cypress using sev- 
eral molecular techniques. Of particular note, we have 
generated a BAG library for bald cypress. This is the first 
large-insert library for a member of the Gupressaceae and 
a logical tool in eventual sequencing and assembly of the 
bald cypress genome. With regard to genome biology, the 
nuclear DNA of bald cypress appears to be largely com- 
posed of relatively low-copy repeats. These low-copy 
sequences may have arisen from ancient mobile element 
amplification events followed by millions of years of 
mobile element quiescence. Sequencing of BAGs should 
provide key information on gene structure while helping 
to define the roles that gene duplication, pseudogenes, and 
repeat sequence divergence have played in shaping the 
Taxodium genome. 

Methods 

Plant materials and DNA isolation 

Young leaves were collected from a bald cypress tree 
located at 33° 27.2729' N, 88° 47.5898' W on the Missis- 
sippi State University campus, and nuclear DNA was iso- 
lated according to the protocol described in Additional file 
3. The DNA was sonicated into fragments with a mean 
length of 450 bp (Additional file 4), and metal ions were 
removed from the DNA solution using Ghelex-100 (Addi- 
tional file 5). The fragmented DNA was ethanol precipi- 
tated and re-dissolved in 0.5 M sodium phosphate buffer 
(Additional file 6) for Got analysis or 10 mM Tris-HGl 
(pH 8.0) for other uses. 

Melting curve and Cot analysis 

DNA preparation and melting analyses were performed 
as described previously [60]. Got analysis was performed 
according to Peterson et al. [20,60]. Got data was ana- 
lyzed using the program GotQuest [34]. 

Cot filtration and highly repetitive DNA library 
construction 

From the Got analysis we determined that the Gotl/2 
value of the HR component was 4.64 M-s (Table 1). To 
isolate the HR component and foldback DNA, a bald 
cypress DNA sample was denatured and allowed to reas- 
sociate to Got 6 M-s (rounded up from 4.64 M-s for sim- 
plicity). The resulting double- stranded DNA (Got < 6 M-s) 
was isolated using hydroxyapatite chromatography and 
ligated into either the pGEM T-Easy (Promega) or TOPO 
TA (Invitrogen) cloning vector. The resulting recombinant 
molecules were used to transform ElectroMAX DHIOB 
Tl Phage-Resistant E, coli cells (Invitrogen) according to 
Peterson et al. [20]. HR clones were end-sequenced (sin- 
gle-pass) using an ABI 3730x1 DNA analyzer. We obtained 
high-quality reads for 743 clones. The HR clone sequences 
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were deposited in GenBank (GenBank: ET185231- 
ET185973). 

HR sequence analysis 

The HR sequence reads were analyzed using the Sequence 
Read Classification Pipeline (SRCP) previously described 
in Chouvarine et al. [36]. Additionally, the sequences were 
evaluated using the miniature inverted-repeat transposable 
element (MITE) identification tool MUST [37] with the 
following parameters: minimum TIR length = 8 nt; maxi- 
mum TIR length = 50 nt; minimum DR length = 2 nt; 
maximum DR length = 30 nt; minimum MITE length = 
100 nt; maximum MITE length = 1000 nt; minimum in- 
cluster identity = 1. In Figure 2, potential MITEs detected 
by MUST were only included under the MITE heading if 
they were categorized as either "Unknown" or "Probable 
Repeat" by the SRCP. For a putative MITE in an SRCP- 
classified "Unknown" or "Probable Repeat" read, only the 
length of the MITE itself (TIR and inter-TIR sequence) 
was considered MITE sequence; the remainder of the 
bases in the read were classified based upon the read's 
SRCP categorization. 

BAC library construction and characterization 

Young leaves were obtained from a bald cypress tree 
located at 33° 27.4330' N, 88° 47.3777' W on the Missis- 
sippi State University campus. We have designated this 
tree MSSTATE#1, and rooted cuttings from the tree have 
been sent to the USDA's Southern Institute for Forest 
Genetics (Saucier, MS) for maintenance and distribution. 
A BAC library was produced from MSSTATE#1 leaf 
nuclear DNA according to Peterson et al. [61]. A "pre- 
electrophoresis" step, as detailed in Magbanua et al. [31], 
was added to the BAC library protocol during the size- 
selections preceding the production of plates 661-919. 
Introduction of the "pre-electrophoresis" step appears to 
have led to a decrease in false positives clones in the latter 
60% of the library compared to the first 40% (see Results). 

Preparation of radiolabeled probes and macroarray 
hybridization were conducted as described in Magbanua 
et al. [31]. Southern hybridization was conducted using 
standard techniques [62]. 

Assessment of repeats in BAC library 

The macroarray image showing TdCRTl hybridization 
was analyzed using an Alphalnnotech densitometer. 
Repeat copy numbers were estimated from macroarrays 
as described in Magbanua et al. [31]. 

Additional material 



Additional file 1: Miniature inverted-repeat transposable elements 
detected in bald cypress HR sequences. Cot-filtered, highly repetitive 
(HR) sequences contair^ir^g MITEs are listed in FASTA format. Each MITE is 



represented by a gold highlighted internal sequence flanked on each 
end by a terminal inverted-repeat sequence (green highlight) and a 
target site duplication (direct repeat, yellow highlight). 

Additional file 2: Primers and overgos used in macroarray analysis. 

Additional file 3: Isolation of nuclear DNA from plants. This is the 
standard Mississippi Genome Exploration Laboratory (MGEL) nuclear DNA 
isolation protocol used to obtain the milligram quantities of plant 
nuclear DNA required for Cot analysis and Cot filtration. 

Additional file 4: Shearing DNA into 450 bp fragments using the 
Misonix Sonicator 3000. This is a detailed MGEL protocol describing 
the method used to shear Taxodium distichum nuclear DNA for use in 
Cot analysis and Cot filtration. 

Additional file 5: Removing metal ions from DNA solutions using 
Chelex. This is a detailed MGEL protocol describing removal of metal 
ions from DNA solutions using Chelex. 

Additional file 6: Preparing 0.5 M sodium phosphate buffer (SPB). 

This is a detailed MGEL protocol describing how to prepare sodium 
phosphate buffer for use in Cot analysis and Cot filtration. 
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